[CI] Build the test image once and pull it from ECR#5905
Conversation
build.yml rebuilt the test image independently in both test-isaaclab-tasks and test-general (~23 min each, run in parallel and racing the GHA layer cache). Add a single build job that builds the image and pushes it to ECR via the ecr-build-push-pull action (ported from develop), and have the test jobs depend on it and pull the prebuilt image instead of rebuilding. The action resolves the ECR URL from the runner's SSM ecr-cache-url parameter and falls back to a local build when ECR is unavailable, so runners without ECR keep working as before. Preserves main's image (docker/Dockerfile.curobo), test execution, and result handling.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot — PR #5905
Summary
This PR eliminates a redundant ~23-min Docker image build by introducing a dedicated build job that pushes to ECR once, then has both test-isaaclab-tasks and test-general pull the prebuilt image. A new composite action (.github/actions/ecr-build-push-pull/action.yml) encapsulates the build/push/pull logic with a multi-level caching strategy (exact commit tag → deps hash → full build).
🧪 Isaac Lab Expert
Architecture: ✅ Correct build-once/pull pattern
- The job dependency graph (
build → {test-isaaclab-tasks, test-general} → combine-results) ensures the image is always available before test jobs start. - The
cache-tag: cache-main-curobocorrectly namespaces main's layer cache away from develop/beta2. - The action is ported from the already-proven
developworkflow — good reuse.
Race condition risk: ✅ None identified
- The
needs: [build]dependency in both test jobs provides a hard barrier — no test job starts until the build job succeeds. - The previous approach had both test jobs building in parallel racing the GHA layer cache; this eliminates that race entirely.
Runner affinity consideration:
- The
buildjob and test jobs all use[self-hosted, gpu]runners. If the test jobs land on a different runner than the build job, they will pull from ECR (not local cache). This is by design and the action handles it correctly viadocker manifest inspect→ pull → tag. However, if the ECR push in the build job fails silently (see below), the test jobs would fall back to a full local build — which is actually the correct degraded behavior.
🔇 Silent Failure Hunter
1. ECR push failure in build job (Step 7) — Medium risk
- If
docker pushfails in the build job's "Push to ECR" step, it will cause the build job to fail (non-zero exit), which would block test jobs entirely due toneeds: [build]. - ✅ This is correct behavior — a hard failure is better than a silent one.
2. ECR unavailability in test jobs — ✅ Safe
- When test jobs invoke the action and ECR is unavailable, the action resolves no ECR URL and falls back to a full local build (Step 6). The
|| trueguards on IMDS/SSM calls prevent hard failures during resolution. - However, since the
buildjob already pushed the image, the test jobs should always find it via the commit-tagged ECR image (Step 4 manifest check → pull).
3. Docker credential store bypass — ✅ Properly handled
- Step 1 creates a temp config with
credsStore: ""and cleans it up in Step 9 (if: always()).
4. Potential gap: docker buildx imagetools create failure in deps-cache step (Step 5)
- If
docker buildx imagetools create -t "${ECR_IMAGE}" "${DEPS_ECR_IMAGE}"fails, the step would exit non-zero and the build job would fail. This is safe (fails loud).
5. || true patterns in resolution steps — Acceptable
- The
|| trueguards on IMDS token, instance ID, region, and SSM are intentional — they allow graceful degradation on non-EC2 runners.
📊 Test Coverage Analyzer
- No new tests needed. This is a CI-infrastructure-only change. The existing CI run of this PR serves as the integration test (build → pull → test pattern).
- The test plan in the PR description is appropriate: validate YAML, observe the green CI run.
🏁 Verdict: LGTM ✅
Well-structured CI optimization that follows the proven pattern from develop. The build-once/pull architecture is correct, race conditions are eliminated by job dependencies, and silent failures are properly handled with graceful degradation.
Minor suggestions (non-blocking):
- Consider adding a brief comment in
build.ymlnoting thatecr-build-push-pullin test jobs will effectively just pull (since build already pushed), to aid future maintainers reading the workflow. - The
timeout-minutes: 180on thebuildjob matches the test jobs — this is generous for a single build. A tighter timeout (e.g., 60 min) could catch hangs earlier, though this is cosmetic given the existing test job timeouts.
Reviewed at commit: 7c22ca8
Update (c3c195c): New push is a merge from main into the feature branch (nightly-changelog workflow improvements, daqp bump to 0.8.5, version bump to 0.54.4). The PR's own files (.github/actions/ecr-build-push-pull/action.yml and .github/workflows/build.yml) are unchanged. Previous review still applies — LGTM holds.
Re-validated at: c3c195c | 2026-06-02T18:14 UTC
Update (49388a8): Two targeted improvements to the composite action:
-
🔒 Security fix (NGC credential leak): Changed
docker login -p $NGC_API_KEYto pipe the key via--password-stdin, preventing the secret from appearing in/proc/<pid>/cmdlineon shared self-hosted runners. Good catch — this was a real exposure vector on multi-tenant machines. -
📋 Observability improvement (cache fallback logging): Added explicit log messages when ECR cache is available vs unavailable. The
elsebranch now logs🟠 ECR unavailable: full local build with no shared cache (build-once degrades to build-per-job)— making degradation immediately visible in CI output rather than requiring log archaeology.
Both changes are non-functional (no logic change), improve security posture and debuggability. LGTM still holds. ✅
Re-validated at: 49388a8 | 2026-06-05T21:28 UTC
Greptile SummaryThis PR eliminates the duplicate ~23-minute Docker image build that previously ran independently in both
Confidence Score: 4/5Safe to merge — the workflow change is well-structured and the fallback path is sound; the only noted issue is a minor docker credential hygiene concern. The build job correctly gates both test jobs via needs, so test jobs never run against a missing image. The three-tier ECR caching logic handles all cache states correctly, and the fallback to local-only build when ECR is unavailable preserves the pre-PR behavior. The one finding worth fixing is the NGC API key being passed as a -p flag to docker login rather than via --password-stdin, which briefly exposes it in the process table on the self-hosted runner. .github/actions/ecr-build-push-pull/action.yml — docker login credential handling on line 66; otherwise the workflow changes in build.yml are straightforward. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[build job] --> B{resolve-ecr}
B -- ECR available --> C{exact commit tag in ECR?}
C -- hit --> D[Pull exact image from ECR]
C -- miss --> E{deps hash in ECR?}
E -- hit --> F[imagetools create alias commit tag]
E -- miss --> G[Full docker buildx build]
G --> H[Push commit tag to ECR]
H --> I[Push deps tag to ECR]
B -- ECR unavailable --> G2[Full local build, no push]
D --> J[build job complete]
F --> J
H --> J
G2 --> J
J --> K[test-isaaclab-tasks]
J --> L[test-general]
K --> K1[ecr-build-push-pull: pulls or rebuilds]
L --> L1[ecr-build-push-pull: pulls or rebuilds]
K1 --> M[run-tests]
L1 --> N[run-tests]
M --> O[combine-results]
N --> O
Reviews (1): Last reviewed commit: "Merge branch 'main' into jichuanh/main-c..." | Re-trigger Greptile |
|
|
||
| if [ -n "${{ env.NGC_API_KEY }}" ]; then | ||
| echo "🔵 Logging into nvcr.io..." | ||
| docker login -u \$oauthtoken -p ${{ env.NGC_API_KEY }} nvcr.io |
There was a problem hiding this comment.
Passing the NGC API key as a
-p flag exposes the secret value in the process argument list (/proc/<pid>/cmdline) for the duration of the docker login call. Any process on the same self-hosted runner with read access to /proc can capture it. Piping via --password-stdin avoids this entirely.
| docker login -u \$oauthtoken -p ${{ env.NGC_API_KEY }} nvcr.io | |
| echo "${{ env.NGC_API_KEY }}" | docker login -u \$oauthtoken --password-stdin nvcr.io |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| - name: Full build | ||
| if: steps.pull-exact.outputs.hit != 'true' && steps.deps-cache.outputs.deps-cache-hit != 'true' |
There was a problem hiding this comment.
Full build runs in test jobs when ECR unavailable
When resolve-ecr resolves no ECR URL, both pull-exact and deps-cache are skipped (their if conditions gate on resolve-ecr.outputs.available == 'true'). Because neither step ran, their outputs remain unset (not 'true'), so the Full build condition passes and each test job performs an independent full build from scratch — which is the correct fallback, but means the build-once goal silently degrades to build-three-times when ECR is unavailable. The PR description mentions this explicitly, so it is intentional; consider adding a log line in the action to make this fallback observable in the CI log.
Pass the NGC API key to docker login via --password-stdin instead of -p so it never appears in the process argument list on the shared self-hosted runner. Log a clear line in the full-build step when ECR is unavailable, so the degradation from build-once to build-per-job is visible in CI logs.
1. Summary
build.ymlrebuilds the test Docker image independently in bothtest-isaaclab-tasksandtest-general(~23 min each, in parallel, racing the shared GHA layer cache). This builds the same image twice every run.buildjob that pushes it to ECR, then has the two test jobs depend on it and pull the prebuilt image instead of rebuilding.ecr-build-push-pullaction already used bydevelop/releaseon the same self-hosted runner pool; it resolves the ECR URL from the runner's SSMecr-cache-urlparameter and falls back to a local build when ECR is unavailable, so no runner regresses.2. Changes
.github/actions/ecr-build-push-pull/action.yml— generic drop-in fordocker-buildwith ECR-backed caching (ported verbatim fromdevelop). Builds + pushes the commit-tagged image, or pulls it if it already exists in ECR..github/workflows/build.yml:buildjob that builds and pushes the image to ECR once.test-isaaclab-tasksandtest-generalnowneeds: [build]and replace their per-jobBuild Docker Imagestep with an ECR pull of the sameimage-tag.run-tests, result copy/upload, fork-result checks, andcombine-resultsare unchanged. Job graph:build → {test-isaaclab-tasks, test-general} → combine-results.3. Adapted to main (not a develop bulk-copy)
run-tests/combine-results, and the existing image (docker/Dockerfile.curobo) — this only changes how many times the image is built, not what is built.cache-tag: cache-main-curoboso main's layer cache does not collide with develop/beta2's namespace in the shared ECR repo.4. Expected effect
5. Test plan
build.ymland the new action validate as YAML; job dependency graph isbuild → test jobs → combine-results.buildjob pushes once;test-isaaclab-tasks/test-generalpull the prebuilt image (no rebuild). Draft until that is observed green.Note: targets
maindeliberately —developalready uses this build-once/ECR pattern; this brings main'sbuild.ymlin line.